Aligning Clattses in Parallel Texts

نویسندگان

  • Sotiris Boutsis
  • Stelios Piperidis
چکیده

This paper describes a method for the automatic alignment of parallel texts at clause level. The method features statistical techniques coupled with shallow linguistic processing. It presupposes a parallel bilingual corpus and identifies alignments between the clauses of the source and target language sides of the corpus. Parallel texts are first statistically aligned at sentence level and then tagged with their part-of-speech categories. Regular grammars functioning on tags, recognize clauses on both sides of the parallel text. A probabilistic model is applied next, operating on the basis of word occurrence and co-occurrence probabilities and character lengths. Depending on sentence size, possible alignments arc fed into a dynamic progranuning framework or a simulated annealing system in order to find or approxim~te the best alignment. 1he method has been tested on a Small Eng~ lish-Greek corpus consisting of texts relevant to software systems and has produced promising results in terms of correctly identified clause alignments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Aligning Turkish and English Parallel Texts for Statistical Machine Translation

This paper presents a preliminary work on aligning Turkish and English parallel texts towards developing a statistical machine translation system for English and Turkish. To avoid the data sparseness problem and to uncover relations between sublexical components of words such as morphemes, we have converted our parallel texts to a morphemic representation and then used standard word alignment a...

متن کامل

Sentence Alignment of Brazilian Portuguese and English Parallel Texts

Parallel texts – texts in one language and their translations to other languages – are becoming more and more available nowadays on the Web. Aligning these texts means to find some correspondence between them, in sentence level, for instance. In this paper we describe some experiments done with Brazilian Portuguese and English parallel texts using five well known sentence alignment methods. The...

متن کامل

A Program for Aligning Sentences in Bilingual Corpora

Researchers in both machine Iranslation (e.g., Brown et al., 1990) and bilingual lexicography (e.g., Klavans and Tzoukermann, 1990) have recently become interested in studying parallel texts, texts such as the Canadian Hansards (parliamentary proceedings) which are available in multiple languages (French and English). This paper describes a method for aligning sentences in these parallel texts,...

متن کامل

Char_align: A Program for Aligning Parallel Texts at the Character Level

There have been a number of recent papers on aligning parallel texts at the sentence level, e.g., Brown et al (1991), Gale and Church (to appear), Isabelle (1992), Kay and Ro . . senschein (to appear), Simard et al (1992), Warwick− Armstrong and Russell (1990). On clean inputs, such as the Canadian Hansards, these methods have been very successful (at least 96% correct by sentence). Unfortunate...

متن کامل

Aligning Noisy Parallel Corpora Across Language Groups : Word Pair Feature Matching by Dynamic Time Warping

We propose a new algorithm, DK-vec, for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. This algorithm produces a small bilingual lexicon which provides anchor points for alignment.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998